Subtle adversarial image manipulations influence both human and machine perception

Although artificial neural networks (ANNs) were inspired by the brain, ANNs exhibit a brittleness not generally observed in human perception. One shortcoming of ANNs is their susceptibility to adversarial perturbations—subtle modulations of natural images that result in changes to classification decisions, such as confidently mislabelling an image of an elephant, initially classified correctly, as a clock. In contrast, a human observer might well dismiss the perturbations as an innocuous imaging artifact. This phenomenon may point to a fundamental difference between human and machine perception, but it drives one to ask whether human sensitivity to adversarial perturbations might be revealed with appropriate behavioral measures. Here, we find that adversarial perturbations that fool ANNs similarly bias human choice. We further show that the effect is more likely driven by higher-order statistics of natural images to which both humans and ANNs are sensitive, rather than by the detailed architecture of the ANN.


Retinal blurring layer:
Let d viewer be the distance (in meters) of the viewer from the display and d hw be the height and width of a square image (in meters). For every spatial position (in meters) c = (x, y) 2 R 2 in the image we compute the retinal eccentricity (in radians) as follows: and turn this into a target resolution in units of radians r rad (c) = min (↵✓ (c) , ) .
We then turn this target resolution into a target spatial resolution in the plane of the screen, r m (c) = r rad (c) 1 + tan 2 (✓(c)) , r pixel (c) = r m (c) · [pixels per meter].
This spatial resolution for two point discrimination is then converted into a corresponding low-pass cuto↵ frequency, in units of cycles per pixel, where the numerator is ⇡ rather than 2⇡ since the two point discrimination distance r pixel is half the wavelength. Finally, this target low-pass spatial frequency f (c) for each pixel is used to linearly interpolate each pixel value from the corresponding pixel in a set of low pass filtered images, as described in the following Algorithm 1 (all operations on matrices are assumed to be performed elementwise), We additionally cropped X retinal to 90% width before use, to remove artifacts from the image edge.
Note that because the per-pixel blurring is performed using linear interpolation into images that were low-pass filtered in Fourier space, this transformation is both fast to compute and fully di↵erentiable.
1 Algorithm 1 Applying retinal blur to an image X img input image # F: image containing corresponding target lowpass frequency for each pixel.
G norm of spatial frequency at each position in Y CUTOFF FREQS list of frequencies to use as cuto↵s for low-pass filtering end for w(c) linear interpolation coe cients for F (c) into CUTOFF FREQS 8c X retinal (c)

B Supplementary Note 2 B.1 Per class analysis
In addition to the overall ANOVA with ✏ and target class as independent variables, we conducted ANOVAs for each target class separately. Experiments 2 and 3 each had four target classes (dog, cat, bottle, bird), tested in a betweenparticipant design. Experiment 4 had four target-class pairs (cat-truck, sheepchair, dog-bottle, elephant-clock), also tested in a between-participant design. For each analysis by class (or class pair), there were approximately 100 participants (see Table Supp.18 for exact counts). For every target class, we find a preference for the adversarial image that is reliably above chance (Table Supp.2). Thus, over the set of classes that we chose in advance, the neural network predicted human preferences for every one. Additionally, for about half the target classes, we find a reliable e↵ect of ✏ (Table Supp.3). We further bolstered this analysis by conducting a more targeted test that focuses on the correlation between perturbation magnitude and human perceptual bias. We computed the Spearman correlation between ✏ and perceptual bias for each participant in each condition and tested whether the correlation was reliably positive. We find that the correlation was significantly positive for 8 of the 12 conditions (Table Supp.4).

B.2 MS-SSIM and perceptual bias analysis
In this section, we examine the relationship between the multi-scale structural similarity (MS-SSIM) 36 and the perceptual bias induced by an adversarial image (i.e., the probability above chance that participants chose the adversarial image). MS-SSIM measures the perceptual similarity of an original image and perturbed image. Thus, MS-SSIM is a measure of saliency of the perturbations and is also a good proxy for homogeneity of image. Since MS-SSIM is a measure that compares original and perturbed images it also necessarily is sensitive to the magnitude of the adversarial perturbations. For the same set of original images, we compute the correlation between the perceptual bias and the MS-SSIM metric computed for adversarial images with perturbations of the same magnitude ✏. We compute the correlation individually for each value of perturbation magnitude ✏ to focus on the e↵ect of the homogeneity of the original image on perceptual bias rather than perturbation magnitude. At all perturbation magnitudes, we find no evidence of di↵erence in participants' sensitivity to perturbations that are highly salient (e.g., textures painted into a uniform background such as the sky) than to ones that are less salient ( Figure Supp.11

C Supplementary Note 3
Experiment SI-4: 10-way classification We conducted an additional experiment in which we asked participants to classify adversarially perturbed images. Participants chose a response class from 10 alternatives, one of which was always the true image class (T ), one was the adversarial class (A), and one was an alternative class (A 0 ). This experiment included pairs of perturbed images that came from the same source image, one warped from target class T to adversarial class A (T ! A) and one to a di↵erent class A 0 (T ! A 0 ), all with ✏ = 16. If an adversarial perturbation is a↵ecting the ostensible image class, then one would expect many A responses for a T ! A image but few A responses for a T ! A 0 image and vice versa. The di↵erence in T ! A and T ! A 0 error rates is a measure of misclassification corrected for intrinsic ambiguity in the imxage. We identified {T ! A, T ! A 0 } pairs, which individually had been shown to at least three participants. (Fewer than three participants would not provide a meaningful estimate of error rate.) With 310 such pairs, we see a higher response rate for A with T ! A vs. T ! A 0 (1.6% vs. 0.9%) but this di↵erence is not statistically reliable (t(309) = 1.83, p = 0.069, Cohen's d=0.11, 95% Confidence Interval of di↵erence between means=[0.0, 0.2]). It is likely that a better controlled study would have found a di↵erence though, so we further investigated the nature of the manipulated images that yielded high error rates. Figure Supp.12 shows the 10 T ! A images with the highest corrected misclassification rate along with the corresponding original image. Note that (1) only a few images are responsible for the di↵erence in error rates, (2) some images are intrinsically ambiguous or have features in the original image suggestive of class A (e.g., the evidence for a person sitting in the A = chair images), and (3) there is little evidence that the adversarial perturbations "paint in" a canonical instance of of the adversarial class. Because of (2) and (3), none of the images "look like" adversarial class A and so we suspect that there would be few if any A responses if participants were not making a forced-choice selection. To address the possibility whether such a few images were responsible of the main e↵ects we find, we identified 25 in the dataset for which even weak evidence existed of potential misclassification (i.e., given a pair of images in which original class T was perturbed to classes A and A 0 , participants made even one more misclassification of A on T ! A than on T ! A 0 ). We then repeated our ANOVA for main Exp. 4 but first excluding those 25 images. The results of this ANOVA were unchanged from those reported in the main paper for Exp. 4: we obtain a main e↵ect of class (F (3, 385) = 15.7, p < 0.001, ⌘ 2 p = 0.11) and a main e↵ect of epsilon (F (3, 1155) = 19.1, p < 0.001, ⌘ 2 p = 0.05). Numerically, the condition means look quite similar suggesting that these misperceived images are unlikely to have driven the e↵ect in Experiment 4. We report 95% Confidence Interval of perceptual bias for the above ANOVA analysis in Table Supp.10.

D Supplementary Note 4
In order to validate the assumption of equal variance across comparison groups in Experiments 1-5, we calculated the variances of the data and found that they were roughly uniform (Table Supp.11). Due to a floor e↵ect inherent in our experiment design, where a systematic bias below 0.5 can not occur (since random choice is at 0.5), we expected and observed a non-normal yet unimodal distribution of the dependent variable's residuals in some of the Experiments (Figure Supp.13 and Table Supp.13, where we used Shapiro-Wilk test to assess the normality of the residuals). However, the mixed-ANOVA analyses we performed are known to be robust to the dependent variable violating the normality assumption. To further confirm that our results are not due to chance, we conducted a nonparametric shu✏ing analysis on the data from Experiment 1-4. For each experiment, randomly reassigned the factor-level labels and reanalyzed the data. The shu✏ing procedure eliminated all the e↵ects we observed in our unshu✏ed data (Table Supp.7). The F score we obtained for the un-shu✏ed data was an extreme outlier in each case, suggesting that the e↵ects of perturbation magnitude and object category that we observed are robust and highly unlikely to occur by chance (Table Supp.8).

E Supplementary Note 5
COCO images displayed in this manuscript are covered by Creative Commons BY 2.0 Attributions license (CC-BY2.0). Below we list all sources for these images.

F Supplementary Tables
1.0 ± 0.00 1.0 ± 0.00 1.0 ± 0.00 1.0 ± 0.00 E3 0.95 ± 0.01 0.99 ± 0.01 1.0 ± 0.00 1.0 ± 0.00 E4 0.85 ± 0.01 0.97 ± 0.01 1.0 ± 0.00 1.0 ± 0.00  and convolution adversarial images. We compared the distribution of the mean pixel intensity (A) and standard deviation of the pixel intensity (B) of adversarial images created using the self-attention and convolutional ANN. We could not find a significant di↵erence between the mean pixel intensity distribution of self-attention and convolutional adversarial images (Kolmogorov Smirnov test, Z = 0.007, p = 0.99, e↵ect size D = 0.0005). Likewise, we could not find significant di↵erence between their standard deviation distributions (Kolmogorov Smirnov test, Z = 0.019, p = 0.99, e↵ect size D = 0.001). These results suggest that our observed human perceptual e↵ects cannot be explained by low-level image statistics related to luminance and contrast.    Figure 2c). Both the convolutional and self-attention ANNs produce adversarial images that reliably bias participants; averaging across target classes and ✏, the bias (deviation from chance) for convolutional ANNs is 1.51% (t(380) = 3.91, p < 0.001, Cohen's d=0.2, 95% CI of bias = [0.01, 0.02], two-sided ttest) and for selfattention ANNs in 2.52% (t(380) = 5.98, p < 0.001, Cohen's d=0.31, 95% CI of bias = [0.02, 0.03], two-sided ttest) , but we do not find credible evidence that these biases are di↵erent (paired t(379) = 1.87, p = 0.062, Cohen's d=0.13, 95% CI of di↵erence between means = [0.0, 0.2], two-tailed ttest). We also conducted a three-way ANOVA with the factors: class pair, ✏, and model (convolutional vs. self-attention). Other than the overall bias being reliably nonzero (F (1, 377) = 45.2, p < 0.001, ⌘ 2 p = 0.107, 95% CI of mean bias=[0.01, 0.03]), all other e↵ects and interactions were non-significant, including main e↵ects of ✏ (F (1, 377) = 3.61, p = 0.058, ⌘ 2 p = 0.009) and target classes (F (3, 377) = 2.43, p = .065, ⌘ 2 p = 0.019). We therefore consider this experiment inconclusive in determining the relative e↵ectiveness of convolutional and self-attention models.   Each trial consists of a pair of images, one adversarially perturbed toward the cat class and one with the same perturbation only flipped left-right. In a replication condition, we ask participants which image is more cat-like. In a second condition, we ask which image is more distorted. And in a third condition we ask, which image is more like a truck (i.e., a class which is di↵erent from the adversarial class). The three conditions were between participant, with n=50 independent participants per condition. Example image is drawn from a collection of images from the Microsoft COCO dataset 62 and OpenImages dataset 63 . (b) Box plots (same convention as Figure 2c) showing that both the 'distorted' condition and the 'cat' condition obtained a reliable bias (from n=50 independent participants) in selecting the adversarial image (distorted: t(49) = 6.241, p < 0.001, Cohen's d=0.88, 95% CI of bias=[0.09, 0.17], two-tailed ; cat: t(49) = 6.33, p < 0.001, Cohen's d=0.9, 95% CI of bias=[0.1, 0.18], two-tailed ttest). However, there was no reliable di↵erence between the two conditions (t(49) = 0.439, p = 0.66, Cohen's d=0.09, 95% CI of bias=[-0.05, 0.07], two-tailed). However, we find no credible evidence that the 'truck' condition produced any deviation from chance (t(49) = 0.545, p = 0.588, Cohen's d=0.077, 95% CI of bias=[-0.02, 0.04], two-tailed), and the 'cat' bias was larger than the 'truck' bias (t(49) = 4.984, p < 0.001, Cohen's d=0.996, 95% CI of the di↵erence between their means is computed as greater than 0.09, onetailed ttest). Thus, the experiment shows that choice based on image distortion could produce a bias as large as we observed for the ✏ = 16 'cat' condition of Experiment 3, in actuality participants were not choosing based on distortion. Were participants choosing based on distortion, the adversarial class would be irrelevant, and one would expect no di↵erence between the 'cat' and 'truck' conditions, which we observe here. For each scatter plot, we use the same set of the original images. For each perturbation magnitude, we find no credible evidence that participants are more sensitive to perturbations that are highly salient (e.g., textures painted into a uniform background such as the sky) than ones that are less salient, as measured

5.
We computed residuals corresponding to the main e↵ect and the e↵ect of independent factors for each experiment, these residuals are shown in the above plot. We note that the distribution of residuals in each case is unimodal. We report statistics for Shapiro-Wilk test of normality of each of the above residuals in Table Supp.13. 31